ANSI and Unicode


ANSI and Code Pages

ANSI is usually a single byte encoding where 256 character codes (0..255) define all available characters for a language. For most languages the ANSI table of 256 characters can hold all characters needed. Windows uses different character tables (Code Pages) for different language groups. E.g. in code page 1251 the character codes (0-127) point to the normal English and non-printable characters. The character codes (128-255) point to Cyrillic characters used in Russian and Ukrainian.

In the ANSI standard the first 128 characters are common to all Code Pages and contain non-printable and English language characters. The extended character codes (128-255) point to different characters for different code pages.

Unicode

In Unicode (UTF-16), a letter maps to something called a code point. Every letter in every alphabet is assigned a magic number by the Unicode consortium, written like this: U+0645.  This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal. The English letter A is then written U+0041. The letters in most languages can be represented by a 2 byte value. But there is no real limit to the number of letters Unicode can define.

UTF-8

UTF-8 is another system for storing your string of Unicode code points using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2 - 6 bytes. This has the effect that an English text saved as UTF-8 looks exactly the same as in ANSI using character codes (0-127). This can save some space in memory or on disk if the text contain a lot of English characters.

Unicode Files

Windows recognizes a Unicode file primarily by its file signature. This signature is called a Unicode Byte Order Mark (BOM). There are 3 different encodings that are commonly used when saving Unicode text.

UTF-16 Little Endian. Used for Windows operating systems. Typically called "Unicode".
BOM = 2 bytes: 0xFF 0xFE
followed by 2 byte pairs. xx 00 xx 00 xx 00 for normal 0-127 ASCII chars.

UTF-16 Big Endian. This is used for Macintosh operating systems.
BOM = 2 bytes: 0xFE 0xFF
followed by 2 byte pairs. 00 xx 00 xx 00 xx for normal 0-127 ASCII chars.
ie. So same as Windows UTF-16 Little Endian but the word bytes are flipped.

UTF-8
BOM = 3 bytes: 0xEF 0xBB 0xBF
followed by single bytes.